Essence: A Portable Methodology for Acquiring Information Extraction Patterns

نویسندگان

  • Neus Català
  • Núria Castell
  • Mario Martín
چکیده

One important issue when constructing Information Extraction systems is how to obtain the knowledge needed for identifying relevant information in a document. In most approaches to this issue, the human expert intervention is necessary in many steps of the acquisition process. In this paper we describe ESSENCE, a new methodology that reduces significantly the need for human intervention. It is based on ELA, a new algorithm for acquiring information extraction patterns. The distinctive features of ESSENCE and ELA are that 1) allow to automatically acquire IE patterns from unrestricted text corpus representative of the domain, due to 2) the ability of identifying surrounding context regularities for semantically relevant concept-words for the IE task by using non domain specific lexical knowledge tools and semantic relations from WordNet, and 3) restricting the human intervention to only the definition of the task and the validation and typification of the set of IE patterns obtained. The use of a general purpose ontology and syntactic tools of general application allows the easy portability of the methodology and reduces the expert effort. Results of the application of this methodology for acquiring extraction patterns in a MUC-like task are also

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Extraction of Drug Crime Patterns and Identifying People at Risk Using Data Mining Techniques

Introduction: In recent years, technology advancement and the growth of information technology in organizations have provided a huge source of data stored in the field of drug-related offenses. Analyzing these data and discovering hidden patterns in it can help detect and prevent the occurrence of crimes in this area. This paper aimed to identify the susceptible people to drug trafficking in Si...

متن کامل

Extraction of Drug Crime Patterns and Identifying People at Risk Using Data Mining Techniques

Introduction: In recent years, technology advancement and the growth of information technology in organizations have provided a huge source of data stored in the field of drug-related offenses. Analyzing these data and discovering hidden patterns in it can help detect and prevent the occurrence of crimes in this area. This paper aimed to identify the susceptible people to drug trafficking in Si...

متن کامل

Data Extraction using Content-Based Handles

In this paper, we present an approach and a visual tool, called HWrap (Handle Based Wrapper), for creating web wrappers to extract data records from web pages. In our approach, we mainly rely on the visible page content to identify data regions on a web page. In our extraction algorithm, we inspired by the way a human user scans the page content for specific data. In particular, we use text fea...

متن کامل

Cross-lingual Information Extraction System Evaluation

In this paper, we discuss the performance of crosslingual information extraction systems employing an automatic pattern acquisition module. This module, which creates extraction patterns starting from a user’s narrative task description, allows rapid customization to new extraction tasks. We compare two approaches: (1) acquiring patterns in the source language, performing source language extrac...

متن کامل

Learning Web Query Patterns for Imitating Wikipedia Articles

This paper presents a novel method for acquiring a set of query patterns to retrieve documents containing important information about an entity. Given an existing Wikipedia category that contains the target entity, we extract and select a small set of query patterns by presuming that formulating search queries with these patterns optimizes the overall precision and coverage of the returned Web ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2000